An evaluation of multi-probe locality sensitive hashing for computing similarities over web-scale query logs

نویسندگان

Graham Cormode

Anirban Dasgupta

Amit Goyal

Chi Hoon Lee

چکیده

Many modern applications of AI such as web search, mobile browsing, image processing, and natural language processing rely on finding similar items from a large database of complex objects. Due to the very large scale of data involved (e.g., users' queries from commercial search engines), computing such near or nearest neighbors is a non-trivial task, as the computational cost grows significantly with the number of items. To address this challenge, we adopt Locality Sensitive Hashing (a.k.a, LSH) methods and evaluate four variants in a distributed computing environment (specifically, Hadoop). We identify several optimizations which improve performance, suitable for deployment in very large scale settings. The experimental results demonstrate our variants of LSH achieve the robust performance with better recall compared with "vanilla" LSH, even when using the same amount of space.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-Level Spherical Locality Sensitive Hashing For Approximate Near Neighbors

This paper introduces “Multi-Level Spherical LSH”: parameter-free, a multi-level, data-dependant Locality Sensitive Hashing data structure for solving the Approximate Near Neighbors Problem (ANN). This data structure is a modified version multi-probe adaptive querying algorithm, with the potential of achieving a O(np + t) query run time, for all inputs n where t <= n. Keywords—Locality Sensitiv...

متن کامل

Intelligent Probing for Locality Sensitive Hashing: Multi-Probe LSH and Beyond

The past decade has been marked by the (continued) explosion of diverse data content and the fast development of intelligent data analytics techniques. One problem we identified in the mid-2000s was similarity search of feature-rich data. The challenge here was achieving both high accuracy and high efficiency in high-dimensional spaces. Locality sensitive hashing (LSH), which uses certain rando...

متن کامل

High-Throughput, Web-Scale Data Stream Clustering

Clustering is an important technique for analysing and interpreting massive quantities of data present on the web. However the sheer volume of data, along with its often dynamic and fast changing nature provide a challenge for traditional clustering approaches. We present a parallel clustering system specifically designed for continuous, real-time clustering of web-scale message data streams. A...

متن کامل

Robust and Efficient Locality Sensitive Hashing for Nearest Neighbor Search in Large Data Sets

Locality sensitive hashing (LSH) has been used extensively as a basis for many data retrieval applications. However, previous approaches, such as random projection and multi-probe hashing, may exhibit high query complexity of up to Θ(n) when the underlying data distribution is highly skewed. This is due to the imbalance in the number of data stored per each bucket, which leads to slow query tim...

متن کامل

Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search

Similarity indices for high-dimensional data are very desirable for building content-based search systems for featurerich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been proposed as indexing techniques for approximate similarity search. A significant drawback of these approaches is the requirement for a large num...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 13 شماره

صفحات -

تاریخ انتشار 2018

An evaluation of multi-probe locality sensitive hashing for computing similarities over web-scale query logs

نویسندگان

چکیده

منابع مشابه

Multi-Level Spherical Locality Sensitive Hashing For Approximate Near Neighbors

Intelligent Probing for Locality Sensitive Hashing: Multi-Probe LSH and Beyond

High-Throughput, Web-Scale Data Stream Clustering

Robust and Efficient Locality Sensitive Hashing for Nearest Neighbor Search in Large Data Sets

Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search

عنوان ژورنال:

اشتراک گذاری